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NONTECHNICAL  SUMMARY 


A  Markovian  Decision  Process  is  a  process  which  is  observed  at 
distinct  time  points  to  be  in  some  state.  After  observing  the  state 
of  the  system  an  action  is  chosen  -  corresponding  to  the  action  (and 
the  present  state)  a  cost  is  incurred  and  the  transition  probabilities 
for  the  next  state  are  determined.  A  policy  is  any  rule  for  choosing 
actions.  Corresponding  to  each  policy  there  is  an  expected  long  run 
average  cost  per  unit  time.  This  paper  is  concerned  with  finding 
an  optimal  policy  -  l.e.  one  whose  associated  average  cost  is  minimal. 

For  example  we  might  have  a  tool  which  wears  out  with  time. 

The  state  of  the  system  could  be  the  length  of  the  tool,  and  the 
possible  actions  could  be  either  to  replace  the  tool  or  not  Associated 
with  each  state  there  would  be  an  operating  cost.  Thus  a  policy  is  a 
rule  for  determining  when  to  replace  the  tool  and  an  optimal  one  is 
one  which  minimizes  the  long  run  average  cost. 

In  the  past  most  of  the  work  in  this  area  has  been  done  under  the 
assumption  that  the  state  space  is  countable.  In  this  paper  we  let 
the  state  space  be  arbitrary.  For  example,  in  the  tool  problem  given 
above  it  is  natural  to  let  the  state  space  be  the  continuum  of  possible 
values  of  the  length  of  the  tool 

This  paper  presents  sufficient  conditions  for  the  existence  of  an 
optimal  policy  and  for  it  to  be  of  simple  type  This  type  -  called 
stationary  deterministic  -  is  or  the  form  of  a  function  mapping  the 
state  space  into  the  action  space  For  example,  in  the  tool  problem 


l 


a  stationary  deterministic  policy  would  replace  whenever  the  length 
of  the  tool  is  in  some  specified  set  of  real  numbers.  The  method 
employed  is  to  treat  the  average  cost  problem  as  a  limit  of  either 
the  discounted  coat  problem  or  the  nondiscounted  n-stage  problem. 

We  also  show  how,  in  a  special  case,  the  average  cost  problem  may 
be  reduced  to  a  discounted  cost  problem. 


ARBITRARY  STATE  MARKOVIAN  DECISION  PROCESSES 


Sheldon  M.  Ross 

1.  Introduction 

We  are  concerned  with  a  process  which  Is  observed  at  times 
t  ■  0,1,2,...  and  classified  Into  one  of  a  possible  number  of  states. 

We  let  x  denote  the  state  space  of  the  process.  x  Is  assumed  to  be 
a  Borel  subset  of  a  complete  separable  metric  space,  and  we  let  ^  be 
the  o>algebra  of  Borel  subsets  of  x-  After  each  classification  an 
action  must  be  chosen  and  we  let  A,  assumed  finite,  denote  the  set 
of  all  possible  actions. 

Let  {Xfc;  t  ■  0,1,2,...}  and  {&t;  t  ■  0,1,2,...}  denote  the 
sequence  of  states  and  actions;  and  let  St_^  ■  (xo'^0' '  * '^t-l’^t-P  ' 

It  is  assumed  that  for  every  xex,  keA  there  is  a  known  probability 
measure  P(-|x,k)  on  such  that,  for  some  version, 

P{^t+leBl^t  "  x»  St-1^  "  for  everY  BeE*  and  ail- 

histories  St_^.  It  is  also  assumed  that  for  every  keA,  Befi^,  P (B |  •  ,k) 
is  a  Balre  function  on  x* 

Whenever  the  process  is  in  state  x  and  action  k  is  chosen  then 
a  bounded  (expected)  cost  C(x,k)  -  assumed,  for  fixed  k,  to  be  a  Balre 
function  in  x-  is  incurred. 

A  policy  R  is  a  set  of  Baire  functions  {D^(St  i^^eA  8at*8fyl-n8 

D^(St_^,x)  _>  0  for  all  keA,  and  I  D^CS^^x)  ■  1  for  every  (St_^,x). 

keA 

The  interpretation  being:  if  at  time  t  the  history  S  ^  has  been 
observed  and  ■  x  then  action  k  is  chosen  with  probability  (S  t  ^,x). 

R  is  said  to  be  stationary  if  D^(St  ^,x)  “  ^(x)  ^or  every  S  R  is 
said  to  be  stationary  deterministic  if  D^(x)  equals  0  or  1  for  all  x,k. 
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n 

For  any  policy  R,  let  4>(x,R)  ■  lim  sup  1/n  E  ER  [C(Xt,At>  |Xq-X]  . 

n  -*■  00  t*0 

Thus  (x,R)  Is  the  expected  average  cost  per  unit  time  when  the  process 
starts  in  state  x  and  policy  R  is  used. 

In  [4]  ,  under  the  assumption  that  x  is  denumerable,  a  number  of 
results  dealing  with  the  average  cost  criterion  were  proven.  The  method 
employed  was  to  treat  the  average  cost  problem  as  a  limit  (as  the  discount 
factor  approaches  unity)  of  the  discounted  cost  problem.  In  this  paper 
we  generalize  some  of  these  results  to  arbitrary  state  spaces.  We  also 
show  how  to  treat  the  average  cost  problem  as  a  limit  of  the  n-atage 
problem.  One  of  the  advantages  of  this  approach  is  that  it  enables  us 
to  determine,  for  denumerable  x»  a  necessary  and  sufficient  condition 
for  the  existence  of  a  bounded  solution  to  a  functional  equation  which 
characterizes  the  optimal  policy. 

2.  Stationary  Deterministic  Optimal  Policies 

The  following  theorem  was  originally  proven  by  Derman  [2]  for  the 
special  case  that  x  is  denumerable.  The  following  proof  is  newj  it 
makes  use  of  a  technique  used  by  Taylor  [5]. 

Theorem  1:  If  there  exists  a  bounded  Baire  function  f (x) ,  xex  and  a 
constant  g,  such  that 

(1)  g  +  f(x)  -  min  (C(x,k)  +  f  f(y)  dP(y|x,k)}  xex 

keA  yex 

then  there  exists  a  stationary  deterministic  policy  R*  such  that 

g  ■  $(x,R*)  ■  min  (x , R)  for  all  xex 
R 
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and  R*  Is  any  policy  which,  for  each  x,  prescribes  an  action  which 
minimizes  the  right  side  of  (1). 

Proof:  For  any  policy  R 

Er  {  I  [f(Xt)  -  ER(f(X  ) I St-1) ] }  ■  0 

t«l 

But 

Er t f (xt ) I S t_ ! ]  “  /  f(y)dP(y|Xt_1,At_1) 

ye* 

-  C(Xt_1,At_1)  +  /  f(y)dP(y|Xt_1,At_1)  - 

yex 

>  min  {ctt^.k)  +  /  f(y)dP(y|xt_1,k)}  - 

keA  ye* 

+  -  c(xt-i'Vi> 


with  equality  for  R*  since  R*  is  defined  to  take  the  minimizing  action. 
Hence 


o  <  er{  z^[f(xt)  -  g  -  f Cxt_1)  +  c(xt_1,At_1)]} 


or 

(2) 


f (x  )  f(x0)  z  c(xt-i’At-i) 

g  Ep  — ““  -  E„  — +  E„  t"1 


R  n 


R  n 


with  equality  for  R*.  Letting  n  »  and  using  the  fact  that  f  is 
bounded,  we  have  that  g  $(R,Xq)  with  equality  for  R*,  and  for  all 
possible  values  of  Xq.  QED. 

Remark:  Note  that  the  above  proof  doesn't  make  use  of  the  fact  that 

A  is  finite  or  that  C(x,k)  is  bounded. 
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satisfy 


Let  gn(x) ,  n  ■  1,2, . . . 

(3)  g,  (x)  -  min  C(x,k) 

1  k 

gn+i<x)  -  min  {C(x,k)  +  f  gn(y)dP(y |x,k) } 
k  ye* 

n-1 

Note  that  g  (x)  ■  min  I  E_[C(X  ,A  )|X-  ■  x].  The  following  corollary 

n  R  t-0  R  *  C  0 

was  proven  by  Derman  [2]  for  the  denumerable  case. 

Corollary  Is  Under  the  conditions  of  theorem  1,  there  is  a  M 

such  that 

|g  (x)  -  ng|  <  M  for  all  n,x 

Proof;  Let  M  be  such  that  |f(x)|  <  M  .  By  (2)  we  have  that 

I 

g  <_  2M  +  gfl(x) .  Again  from  (2),  by  letting  R  ■  R*  we  have  that 
g  >  gn(x)  -  2M*.  QED. 

Fix  some  state  -  call  it  0  -  and  let 

(4)  fn(x)  -  gn(x)  -  gn(0)  all  n,x 
One  has  from  (3)  that 

(5)  8n+1(0)  *  gn(0)  +  fn<x)  ■  “in  {C(x,k)  +  f  fn(y)dP(y |x,k) } 

k  yex 

We  shall  now  determine  sufficient  (and  in  the  denumerable  case 
necessary  and  sufficient)  conditions  for  the  existence  of  a  bounded 
Halre  function  f(x)  and  a  constant  g  satisfying  (1). 

Theorem  2:  If  {f^}  is  a  uniformly  bounded  equicontinuous  family  of 

functions  then 

(1)  there  exists  a  bounded  continuous  function  f (x)  and  a  constant 
g  satisfying  (1) . 
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(ii)  lim  (gn+1(x)  -  gn(x))  -  g  for  all  xe*. 
n-*-» 

Proof :  By  the  Ascoll  Theorem  there  exists  a  subsequence  {f  } 

nk 

and  a  continuous  function  f  such  that  f  (x)  -  f(x),  Now  gn+^(0)  -  8n(0) 

°k 

Is  bounded  (since  costs  are  bounded)  and  so  we  can  also  require  that 

g  ..  (0)  -  g  (0)  -*•  g.  Hence  by  (5)  and  the  bounded  convergence  theorem 
nk1’  ^ 

we  have  that  g  +  f(x)  -  min  (C(x,k)  +  f  f  (y)dP(y|  x,k) } 

ye* 

For  any  subsequence  {n  }  of  {nf  there  is  a  sub-subsequence  (n  ) 

such  that  lim(gn„+^(0)  -  gn„(0) ) exists.  By  the  above  this  limit  must 

be  g.  Thus  g  ■  lim  (g  ,.(0)  -  g  (0)).  The  result  follows  since  0  is 

n+l  n 

n 

any  arbitrary  state.  QED. 

If  *  is  denumerable,  then  {f^}  can  always  be  taken  to  be  equicontinuous 
by  considering  the  discrete  topology.  We  thus  have 

Corollary  2s  If  *  is  denumerable,  then  a  necessary  and  sufficient  condition 

for  the  existence  of  a  bounded  function  f(x)  and  constant  g  satisfying  (1) 

is  that  thara  is  a  M  <  »  such  that  |g  (x)  -  g  (0)  |  <  M  for  all  n,x. 

n 

Proof  i  Sufficiency  follows  from  the  *bove  theorem  and  necessity  follows 
from  Corollary  1.  QED. 

00 

For  any  policy  R,  Be(0,l),  let  ¥(x,B,R)  *  l  0tED[C(X,.,A. )  |xn  -  x] 

t-0  R  C  ° 

A  policy  R0  such  that  iKx,8,R0)  ■  min  Mx,6,R)  for  all  xe»  is  said  to  be 
6  S  R 

6-optimal. 

We  shall  need  the  following  result  given  by  Blackwell  [1], 

If  A  is  finite,  and  C(-,0  is  bounded  then,  for  each  Be (0,1),  there  is 


a  stationary  deterministic  policy  which  is  6-optlmal.  Furthermore 
tKx,B,Rg)  is  the  unique  solution  to 
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(6)  iKx,8,Rfl)  -  min  {C(x,k)  +6  f  4>(y ,6 ,R_)dP(y | x,k) } 

keA  ye* 

and  any  policy  which,  when  in  state  x,  selects  an  action  which  minimizes 
the  right  side  of  (6)  is  8-optimal. 

Fix  some  state  -  call  it  0  -  and  let 

(7)  f0(x)  •  'i'Cx.S.Rg)  -  ii»(0,8 ,Rg) 

then 

(8)  g  +  f g (x)  -  min  (C(x,k)  +  6  /  f  (y)dP(y |x,k) } 

k  ycx 

where 

gg  -  (1-6)  (0,6, Rg) 

In  analagous  fashion  to  Theorem  2  we  have 
Theorem  3:  If  {f„}  is  a  uniformly  bounded  equicontinuous  family  of 

"  D 

functions  then 

(1)  there  exists  a  bounded  continuous  function  f(x)  and  a  constant 
g  satisfying  (1) . 

(ii)  (l-B)Vg(x)  -*•  g  as  6  l”  for  all  xe*. 

Proof :  Same  as  proof  of  Theorem  2, 

For  any  stationary  deterministic  policy  R  let  x(R)  be  the  action 

chooses  when  in  state  x-  We  say  that  lim  R  «■  R  if,  for  each  x, 

there  exists  N  ^  »  such  chat  x(R  )  *  x(R)  for  all  n  >  N  - 
x  n  —  x 

The  following  was  proven  in  [4]  for  denumerable  x •  The  proof 
for  arbitrary  x  is  identical 
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Theorem  4; 


Under  Che  conditions  of  Theorem  3 


(i)  for  some  sequence  6  -  1  ,  R*  •  lim  R 

r  r  6i 

(ii)  if  R  ■  lim  R0  ,  where  8  1  then  R  is  optimal  -  i.e. 

p  r 

r  r 

<P (x,R)  -  g  for  all  xe*. 

The  following  two  conditions  were  given  by  Taylor  [5]  to  prove 
equlcontlnuity  of  ifa)  in  the  special  case  of  a  replacement  process i 

P 

(a)  For  every  keA,  C(*,k)  is  continuous. 

(b)  For  every  xe*,  keA,  P(|x,k)  is  absolutely  continuous  with  respect 
to  some  o-flnlte  measure  u  on  B  and  it  possesses  a  density  p(y|x,k) 
also  assumed  to  be  a  Baire  function  in  x.  Furthermore,  for  every 
xe*,  keA 

lim  /  | P(y | x,k)  -  p(y|x' ,k) |dM  (y)  -  0 

x'  +x 

Theorem  5;  If  conditions  (a)  and  (b)  are  satisfied  then 

(i)  |fg(x)|  <  M  for  all  x,B  ■  ifg}  is  equicontlnuous 

(ii)  j  f n (x) |  <  M  for  all  x,n  { f is  equicontlnuous 

Proof :  Follows  directly  from  (5)  and  (8)  and  conditions  (a),  (b) . 

A  sufficient  condition  for  the  uniform  boundedness  of  if  r  is 

P 

given  in  [4] , 

3.  Reduction  of  Average  Cost  Case  to  Discounted  Cost  Case 
We  shall  need  the  following  assumption 
Assumption  (I);  There  is  a  state  -  call  it  0  -  and  a  0,  such  that 
P{Xt+1  ■  0|Xt  -  x,  -  k>  _  a  for  all  xt*.  keA. 
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For  any  process  satisfying  the  above  Assumption  consider  a  new 
process  with  identical  state  and  action  spaces,  with  identical  costs, 
but  with  transition  probabilities  now  given  by 


P'(B|x,k)  =  ( 


for  o*B 

1  -  a 


P(Blx-k>  -  a  for  OrB 
1  -  a 


BeR 


Let  ij*  (x,8,R)  be  the  total  expected  B-discounted  cost,  and  let 

i 

R.  be  the  B-optimal  policy,  all  with  respect  to  the  new  process. 

P 

If  If  » 

Letting  f  (x)  ■  ^  (x,l-a,R1_a)  -  tp  (0,l-a,R^_a)  we  have  by  (8)  that 


I  I 


(9)  cuj/  (0,l-a,R)  +  f  (x)  ■  min  {C(x,k)  +  (1-a)  /  f  (y)dP  (y|x,k)} 

k  ye* 


min  {C(x,k)  +  /  f  (y)dP(y | x,k) } 

k  ye* 


And  thus  the  conditions  of  Theorem  1  are  satisfied.  It  follows  that 

V  I 

g  ■  aij;  (0,l-a,R^_a)  and  the  optimal  average-cost  policy  is  the  one 
which  selects  the  actions  which  minimize  the  right  side  of  (9).  But 

f 

it  is  easily  seen  that  R^  ^  does  exactly  this.  Hence  the  optimal 

average  cost  policy  is  precisely  the  1-a-optimal  policy  with  respect 

to  the  new  process;  and  the  optimal  expected  average  cost  per  unit 
»  » 

time  is  aip  (0,l-a,R^_a) . 

The  above  result  was  proven  in  [4]  for  the  denumerable  case  by 

I 

showing  that  <f>(x,R)  ■  a\p  (0,l-a,R)  for  any  stationary  policy  R. 
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This  result  also  holds  for  arbitrary  *•  However  this  in  itself 

t 

does  not  show  that  R^  ^  is  optimal.  (It  does  in  the  denumerable 
case  because  it  can  be  shown  that  Assumption  (I)  implies  that  {f^} 
is  uniformly  bounded  and  thus  by  Theorem  3  there  exists  a 
stationary  deterministic  policy  which  is  optimal.) 

4.  Concluding  Remarks 

Results  given  in  [4]  which  dealt  with  e-optimal  policies  and 
replacement  processes  (Sections  3  and  4)  carry  over  to  the  more  general 
spaces  x  considered  here.  The  proofs  are  identical  (with  integrals 
replacing  sums  in  the  obvious  places). 
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